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Abstract: This paper presents a novel phase unwrapping architecture for accelerating the 
computational speed of digital holographic microscopy (DHM). A fast Fourier transform 
(FFT) based phase unwrapping algorithm providing a minimum squared error solution is 
adopted for hardware implementation because of its simplicity and robustness to noise. 
The proposed architecture is realized in a pipeline fashion to maximize throughput of the 
computation. Moreover, the number of hardware multipliers and dividers are minimized 
to reduce the hardware costs. The proposed architecture is used as a custom user 
logic in a system on programmable chip (SOPC) for physical performance measurement. 
Experimental results reveal that the proposed architecture is effective for expediting the 
computational speed while consuming low hardware resources for designing an embedded 
DHM system. 

Keywords: phase unwrapping; digital holographic microscopy; FPGA; reconfigurable 
computing; system on programmable chip 
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1. Introduction 

Digital holographic microscopy (DHM) [1,2] is a highly effective means for non-invasively capturing 
the amplitude and phase data of an optically transparent or reflective specimen [3,4]. Digital holograms 
that contain information about the specimen can be recorded and obtained using a charge-coupled device 
(CCD) or a complementary metal oxide semiconductor (CMOS). However, the phase map derived from 
the reconstructed image of a digital hologram is non-linearly wrapped lying in the interval (-7r,7r]. 
To obtain the three-dimensional profile of a specimen, the wrapped phase map must be unwrapped to 
a continuous phase [5]. Such a phase unwrapping procedure is also important in other applications, 
including synthetic aperture radar interferometry (InSAR) [6] and magnetic resonance imaging 
(MRI) [7]. The phase unwrapping process may be performed offline in some of these applications, 
whose primary concern is the quality of the unwrapped phases. By contrast, for the DHM or electronic 
speckle pattern interferometry applications, in addition to accurate unwrapped phase reconstructions, 
fast phase unwrapping operation is desired [8-12] for attaining realtime video rendering with high frame 
rates. 

A simple raster scan algorithm is able to perform realtime phase unwrapping. However, in the 
presence of noise, the raster-scan algorithm may lead to an accumulation of error that eventually 
results in large deviations near the end of the accumulation. Popular approaches to the robust phase 
unwrapping include least square techniques, where the unwrapped phase is obtained as the function 
whose discrete gradient has the least squares deviation from its available estimate. The Poisson equations 
are then derived for the optimization, which can be solved by the preconditioned conjugate gradient 
(PCG) [13,14] and Gauss-Seidel techniques [13,15]. Recent advances in phase unwrapping include 
the ZnM algorithm [16], which solves the problem as a sequence of binary optimization. Network 
programming techniques [16] and graph cut techniques [17] are then used for the optimization. Another 
efficient approach is to use branch cut technique for phase unwrapping, which can be implemented 
by hybrid genetic algorithms [18]. The common drawback of these techniques are that they are all 
iterative algorithms. The number of iterations may be high [19] and may vary [16,20] depending on the 
input wrapped phase map. For a realtime DHM, phase unwrapping algorithms with low and constant 
computational complexities are usually desired for providing fast video rendering with constant frame 
rate. Consequently, these robust phase unwrapping algorithms may make the design of a realtime DHM 
difficult. Moreover, for the embedded systems with limited computational capacities, the implementation 
of realtime robust phase unwrapping becomes a very challenging issue. 

A number of implementations for fast phase unwrapping have been proposed [19-21]. The 
implementations presented in [20,21] are based on PCG [13,14] with field programmable gate array 
(FPGA) devices and graphic processing unit (GPU) [22] platforms, respectively. The implementation 
proposed in [19] is based on Gauss-Seidel techniques [13,15] with GPU platforms. Because both PCG 
and Gauss-Seidel techniques are iterative algorithms, only moderate acceleration is achieved. Moreover, 
some of these implementations are based on GPU, which may be difficult to be used for embedded 
devices due to high hardware cost and large power consumption. 

The goal of this paper is to present a novel phase unwrapping hardware architecture for accelerating 
the computational speed of DHM. The algorithm [23] selected for the proposed architecture is also a least 
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square technique for robust phase unwrapping. The algorithm is non-iterative, and therefore has constant 
computational complexity. As compared with its iterative least square counterparts [13-15], the selected 
algorithm is more computationally efficient [24,25]. In addition, it is based on fast Fourier transform 
(FFT), which can be efficiently implemented by hardware. Therefore, the algorithm is well-suited for 
the realtime DHM applications. 

Based on the algorithm [23], the architecture is separated into four units: the pre-transform unit, the 
FFT unit, the post-transform unit, and the on-chip memory. The first three units are used for computation 
of the phase unwrapping algorithm. The on-chip memory is used for storing the source data and the 
intermediate results produced by these units so that the memory access time can be reduced. Novel 
pipeline architectures are proposed for the implementation of the pre-transform unit, the FFT unit, and 
the post-transform unit to maximize the throughput of the proposed architecture. Only a single multiplier 
and divider are used in the FFT unit and post-transform unit for lowering hardware resource utilization, 
respectively. 

The proposed architecture has been implemented on FPGA devices so that it can operate in 
conjunction with a softcore CPU. Using the reconfigurable hardware, we are then able to construct a 
system on programmable chip (SOPC) system for the physical performance measurement for phase 
unwrapping in the embedded systems. As compared with its software counterpart running on Intel 1-7 
quad-core CPU, the proposed system has significantly lower computational time for phase unwrapping. 
In particular, when the image resolution is 513 x 513, the proposed system attains the speedup of 605 
over its software counterpart. All these facts demonstrate the effectiveness of the proposed architecture. 

2. The Phase Unwrapping Algorithm 

This section briefly reviews the algorithm adopted for the hardware phase unwrapping 
implementation. Please refer to [23] for detailed discussion of the algorithm. Let be the wrapped 
phase function of an unknown real-valued function faj for 0 < i < N 9 and 0 < j < N, where 
— 7r < Qj < 7T, and e j ^ j = e j ^. Let ^ for 0 < i < 2 AT, and 0 < j < 2N be the periodic extension 
of using the mirror reflection technique. That is, 



V;. 



dj for 0 < i < N, 0 < j < N, 

( 2N _ hJ for N < i < 2N, 0 < j < N, 

0,27v-j forO <i < N,N < j < 2N, 

(2N-i,2N-j for TV < i < 2N, N < j < 2N. 



Define 



A tj = ~ A lj = - (2) 

for all i and j. Note that these values must be computed as wrapped phase differences, where the values 
2tt or — 2tt will be added as necessary to ensure that Af ■ and Af ■ lie in the interval (— 7r, tt]. 

Let be an estimation of faj based on Qj. The goal of the phase unwrapping algorithm is to find 
(j>ij minimizing 

- hi - + Efc+i - hi - A h) 2 - ( 3 ) 

h3 h3 
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It can be shown that the optimal faj is the solution to 




(4) 



where 



• = A x . - A x 



+ 




(5) 



Because both faj and 7^ are periodic, the Fourier transform can be used to solve Equation (4). Applying 
the 2N x 2N two-dimensional Fourier transforms to both sides of Equation (4) yields 



where & m ,n and r m?7l are the Fourier transforms of faj and 7^ , respectively. The function is obtained 
by the inverse Fourier transform to Equation (6). The estimated faj is then obtained by restricting the 
results to the grid defined by 0 < i < N, 0 < j < N. 

Based on the discussions shown above, the phase unwrapping algorithm using the FFT is summarized 
as follows: 

Step 0: Suppose Qj, 0 < i < N 9 0 < j < N, are given. 

Step 1: Compute 7^, 0 < i < N, 0 < j < N, using Equation (5). 

Step 2: Compute T m ^ 0 < i < 2N, 0 < j < 2N, 



Step 2.1 For each row i of the array 7^ , 

compute Xij, 0 < j < 2N, using mirror reflection. 

That is, Xij 7^ for 0 < j < N, and Xij Ji,2N-j for N < j < 2N. 
Step 2.2 Compute A^ n , 0 < n< 2N, the FFT of X id , 0<j< 2N. 
Step 2.3 Replace i-th row of the array 7^ by A ij7l with the restriction that 0 < n < N. 
Step 2.4 After all of the rows are processed in this way, 

repeat the process (Step 2.1-2.3) on columns. 
Step 3: Compute § m , n using Equation (6). 
Step 4: Compute the inverse FFT of Q m ,n to obtain faj. 

3. The Proposed Architecture 

Figure 1 shows the proposed architecture for the FFT-based phase unwrapping algorithm. As shown 
in the figure, the proposed architecture can be separated into four units: the pre-transform unit, the FFT 
unit, the post-transform unit, and the on-chip memory. Given Qj, 0 < i < N, 0 < j < N, the goal of 
the pre-transform unit is to compute 7^ . The FFT unit is then adopted for computing r m n . After that, 
the post-transform unit is used for calculating § m ^ n . Finally, the FFT unit is used again for computing 
(f>ij based on § m ^ n . The on-chip memory is used for storing both the original data and the intermediate 
and final results of the pre-transform unit, the FFT unit and the post-transform unit. Storing the original 
data and intermediate results in the on-chip memory effectively reduces the memory access time for the 
algorithm. 



r m;n /(2cos(7rm/M) + 2cos(7rn/iV) - 4), 



(6) 



using two-dimensional 2N x 2N fast Fourier transform (2D-FFT). 
The 2D-FFT operates as follows. 
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Figure 1. The proposed architecture for phase unwrapping. 
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3.1. On-Chip Memory 

The on-chip memory consists of two identical RAM modules. Each RAM module is able to store an 
(N+l) x (N+l) array. The RAM modules are shared by all the units in the proposed architecture. They 
are used to store the original or intermediate results produced by each unit. These results will then be 
used as the source data for subsequent operations. The employment of the on-chip memory is able to 
significantly reduce the memory access time for phase unwrapping. 

3.2. Pre-Transform Unit 

The goal of the pre-transform unit is to implement Step 1 of the algorithm in hardware. Figure 2 
shows the architecture of the pre-transform unit, which consists of controller, address generator, registers, 
adders, phase wrapping unit, and multiplexer. The address generator is used to generate addresses 
for reading the source data from the on-chip memory, and writing the results to the on-chip memory. 
The registers are used for the implementation of pipeline for enhancing the throughput. As shown in 
Figure 2, there are three stages in the pipeline. The first and the third stages are adders. The second stage 
is the phase wrapping unit, which is employed to ensure that the values of Af ■ and Af j lie in the interval 

(-7T,7T]. 

Figure 3 depicts the architecture of the phase wrapping unit, which contains 2Q + 1 modules. Each 
module z, i = 1, ...,2(5 + 1, contains two comparators for determining whether a phase difference 
computed by Equation (2) lies in the interval (— (2Q + l)7r + 2(i — 1)tt 9 — {2Q + l)7r + 2m\. The phase 
difference is first broadcasted to all the modules. After the interval in which the phase difference lies 
is identified, the phase wrapping operation is then performed accordingly so that the resulting wrapped 
phase difference lies in (— 7r, tt}. 

The source data for the pre-transform operations, Qj, 0 < i < TV, 0 < j < TV, are stored in the 
on-chip RAM 1. The pre-transform unit then produces 7^, 0 < i < N, 0 < j < N, and stores them 
in both the on-chip RAM 1 and RAM 2. The pre-transform unit computes 7^ in two steps. At the first 
step, is retrieved from the on-chip RAM 1 to compute Af ■ — Af_ l j -, which will then be stored in 



Sensors 2011, 11 



9165 



the on-chip RAM 2. Figure 4 shows the timing diagram for the pipeline operation of the pre-transform 
unit at the first step. Figure 5 reveals the input/output to each stage of the pipeline for the shaded time 
interval marked in the Figure 4. 

Figure 2. The architecture of pre-transform unit, where REG and MUX are the abbreviations 
of register and multiplexer, respectively. 



Control signal from/to CPU-*-- 



Controller 



Address Generator 



Address for reading from On-Chip RAM 
Address for writing to On-Chip RAM 



Data from 
On-thip RAM 



•—►REG 



Stage 1 
(Addition) 



REG 



Phase Wrapping 
Module 




Stage2 
(Phase Wrapping) 



Stage3 
(Addition) 



At the second step, is retrieved again from the on-chip RAM 1 to compute Af ■ — Af J -_ 1 . 
In addition, Af ■ — A^_ Xj is retrieved from the on-chip RAM 2. The summation of 
(Afj — Af J _ 1 ) and (Af ■ — Af_ x ■) forms 7^, which will then be stored back to the on-chip RAM 1 and 
RAM 2 for the subsequent FFT operations. Figure 6 shows the timing diagram for the pipeline operation 
of the pre-transform unit at the second step. The input/output to each stage of the pipeline for the shaded 
time interval marked in the Figure 6 is then depicted in Figure 7. Note that, at the final stage of the 
pipeline, the MSBs (most significant bits) and LSBs (least significant bits) of 7^ are stored in on-chip 
RAM 1 and RAM 2, respectively. By storing the results to two modules, the computation precision is 
then doubled for subsequent operations. 

Note that, as shown in Figures 5 and 7, the input to the pre-transform circuit is ^ j, which is the 
mirror reflected version of Qj in accordance with Equation (1). From Equations (2)-(5), it follows 
that the computation of 7^ is based on i/jij instead of Qj. When 0 < i < N, 0 < j < N, because 
ipij = Qj, the Qj stored in on-chip RAM 1 is used as ipij. Otherwise, ^ should be computed using 
Equation (1). It can be easily shown that only i/j-ij 9 ipN+i,j> ^i-u an d i/Ji,N+u 0<i<N,0<j<N 
actually require mirror reflection for the computation of 7 0j , Jnj, 7i,o an d 7z,at. Using Equation (1), it 
follows that ift-ij C1J9 i>N+i,j Cn-ij, ^%-\ Ci,i> an ^ ^i,N+i d,N-u it is not necessary to 
design a circuit for mirror reflection for the pre-transform unit. We only have to reconfigure the address 
generator in the unit so that when i/j-ij, ipN+i,j, or ^i,A^+i desired, the address of Cn-ij* 
0,i or d,N-i will be delivered to on-chip RAM 1, respectively. 
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Figure 3. (a) The architecture of phase- wrapping unit in the pre-computation unit, (b) The 
architecture of each module K in the phase-wrapping unit. 
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Figure 4. The timing diagram for pipeline operation at the first step of pre-transform unit. 
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Figure 5. The input/output to each stage of the pipeline for the shaded time interval marked 
in the Figure 4. 
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Figure 6. The timing diagram for pipeline operation at the second step of pre-transform unit. 
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Another advantage of the employment of address generator is that it is able to generate multiple 
addresses for the concurrent read and write accesses of the on-chip memory. Multiple address generation 
is essential for the implementation of the pipeline in the pre-transform unit. For the shaded time interval 
indicated in Figure 7, the retrieval of ipi+ij and Af ._ 2 — Af -_ 3 are required at stages 1 and 3, respectively. 
In addition, the computation result at stage 3, 7^-2 should also be written to the RAM 1 and RAM 2. 
As shown in Figure 7, the address generator sends three different addresses to the on-chip memory for 
the concurrent access: address for reading i/ii+ij from RAM 1, address for reading Af -_ 2 — -_ 3 from 
RAM 2, and address for writing 7«-2,j to RAM 1 and RAM 2. Other alternatives for memory accesses 
are based on CPU or direct memory access (DMA). However, because there is only one memory access 
at a time, using the CPU or DMA-based memory accesses for the proposed pipeline architecture may be 
difficult. 



Sensors 2011, 11 



9168 



Figure 7. The input/output to each stage of the pipeline for the shaded time interval marked 
in the Figure 6. 
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3.3. FFT Unit 
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The FFT unit is employed for implementing Steps 2 and 4 of the algorithm. Figure 8 shows the 
architecture of the FFT unit, which contains controller, address generator, mirror reflection module and 
one-dimensional FFT (1D-FFT) module. The FFT unit reads the source data from both on-chip RAM 1 
and RAM 2. The results produced by the FFT unit are then stored back to on-chip RAM 1 and RAM 2 
to replace the source data. 

Figure 8. The architecture of FFT unit. 
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In the FFT unit, each row of 7^ is loaded from the on-chip memory one at a time. The FFT unit then 
writes the computational results directly back to the same row in the on-chip memory. After the row 
operations are completed, the column operations will proceed in the same manner. After the completion 
of all the column operations, the array stored in the on-chip RAM is r m n , the two-dimensional FFT 
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From Step 2.1 of the phase unwrapping algorithm, it follows that the mirror reflection is required 
before the 1D-FFT transform (or inverse transform). The mirror reflection module is a 2iV-stage shift 
register. The input to the shift register is located at the stage N + 1. The first N + 1 stages of the shift 
register are able to operate in two directions for the implementation of mirror reflection. The output of 
the shift register is connected to the 1D-FFT module. The proposed module operates in two phases using 
the shift register. At the first phase, each data point entering the mirror reflection module is shifted in 
two directions. After all the data points in a row have entered the shift register, all the data points are 
shifted right one at a time to the output at the second phase. 

We use Altera FFT MegaCore function [26] to implement the 1D-FFT module. The transform 
length of the FFT is 2N. The 1D-FFT module has single data input and single data output. The 
input/output dataflow of the module operates in streaming mode, allowing the continuous process of 
input data stream, as well as producing the continuous output data stream. In addition, the module 
contains only one butterfly processor. Therefore, it requires only a single complex multiplier for the FFT 
implementation [26]. The area cost can then be minimized. 

The FFT unit is able to operate as a two-stage pipeline, where the first stage is mirror reflection, and 
the second stage is 1D-FFT. Figure 9 shows the timing diagram for the pipeline operation of the FFT 
unit for the rows of the array 7^. Note that the operation of each stage of FFT unit is separated into 
two phases. Both the stages will operate at the same phase at the same time. Figures 10 and 11 show 
the operation of phase 1 and phase 2 at each stage, respectively. As shown in Figure 10, at the phase 1 
of the mirror reflection, a row of 7^ (e.g., i-th row) is loaded from on-chip RAM for computing A^-, 
0 < j < 2N. At the same time, the phase 1 of 1D-FFT module uses Xi-ij, 0 < j < 2N as the input 
for computing A^_i n , 0 < n < 2N. The mirror reflection module then delivers A^-, 0 < j < 2N to 
the 1D-FFT module at its phase 2 operation. At the same time, the 1D-FFT module sends the A^_i ?n , 
0 < n < N, to the on-chip RAM, as depicted in Figure 11. 

Note that the FFT unit is also used for the computation of inverse 2D-FFT of 4> mjn . The data stored 
in the on-chip memory is & m ,n- The 1D-FFT module will operate as the ID inverse FFT for the input 
data. The FFT unit will then produce to the on-chip memory. 

Figure 9. The timing diagram for pipeline operation of FFT unit. 
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Figure 10. The operation of phase 1 at each stage of the pipeline for the shaded time interval 
marked in the Figure 9. 
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Figure 11. The operation of phase 2 at each stage of the pipeline for the shaded time interval 
marked in the Figure 9. 
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3.4. Post-Transform Unit 



(1D-FFT) 



The post-transform unit is used for the hardware computation of Step 3 in the algorithm. Therefore, 
the objective of the post-transform unit is to realize Equation (6) in hardware. Figure 12 shows the 
architecture of the post-transform unit, which consists of two cosine computation modules, a divider, and 
adders. The goal of the two cosine computation modules is to compute cos(7rm/Af), m = 0, N — 1, 
and cos (7m/ AT), n = 0, N — 1, respectively. Since m and n only takes possible values for the 
cosine transform, the two modules can be implemented as a simple look-up table (LUT), consisting 
of entries. The i-th entry of the table in the two cosine computation modules contains the value of 
cos(7ri/N). 
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Figure 12. The architecture of post-transform unit. 
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Although LUTs can be used for the implementation of cosine modules, they may be difficult to be used 
for the design of divider in the post-transform unit. From Equation (6), we can see that the nominator and 
denominator of the divider are all real numbers. Consequently, the number of entries of the LUT will be 
very high when it is used for divider design. A general divider which accepts real numbers as the inputs 
is then desired. The divider in the architecture is implemented by Altera Floating Point Megafunction 
ALTFP_DIV [27]. The divider is separated into p pipeline stages to enhance the throughput of the 
post-transform unit. 

Given r m n in the on-chip memory, the post-transform unit operates as follows. The unit loads 
the FFT coefficients r m?n from the on-chip memory one at a time based on the raster scan order. To 
reduce the amount of bus traffic, the address delivered to the on-chip memory for loading r m?7l is also 
delivered to the post-transform unit for extracting the indices m and n, which are then delivered to the 
cosine computation modules for finding cos(7rm/7V) and cos(7m/AT). Both r m?n loaded from the on-chip 
memory and (2 cos(7rm/7V) + 2 cos(7m/AT) — 4) computed by the adders are then used as the input to 
the divider for computing $ m5n . The output of the divider is then stored directly back to the on-chip 
memory. 

The post-transform unit is implemented as a (2 + p) -stage pipeline. As shown in Figure 12, the first 
stage performs the address to index conversion. That is, the address used for retrieving r m?7l from the 
on-chip RAM is used for computing indices m and n at this stage. The second stage of the pipeline 
computes cos(7rm/7V) and cos (7m/ AT) based on table look-up method. In addition, the second stage 
consists of an adder for computing (2 cos(7rm/7V) + 2 cos(7m/AT) — 4). The third stage to the (2 + p)-th 
stage of the pipeline form a p-stage divider for computing § m ^ n . Figure 15 shows the timing diagram for 
the pipeline operation of the post-transform unit. The input/output to each stage of the pipeline for the 
shaded time interval marked in the Figure 13 is then depicted in Figure 14. 

The major advantage of the design is that only the r m n is required from the input ports. The terms 
cos(7rm/7V) and cos (jrn/ N) can be obtained from the address for retrieving r m?n from the on-chip 
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RAM. Based on the address, the computation of cos(7rm/7V) and cos(7m/AT) are then carried out by 
simple LUT. Consequently, the single address for retrieving r m?n actually produces r m n , cos(7rm/7V) 
and cos (7m/ AT) for computing <& m?n using Equation (6). This single-address-multiple-data scheme is 
beneficial for enhancing the computational speed of the post-transform unit. 

Figure 13. The timing diagram for pipeline operation at post-transform unit for p = 2. 



Computation of 



Computation of 
<t> 1 



Computation of 



Address to ^ 

Index 
Conversion 



m,n 



Computation of 



Time 



Table lookup 
& 

Addition 



Division 



(m,n-2) 



2cos(nm/N) + 
2 cosQr(n-2)/AQ- 4 



m,n-2 



Address to 

Index 
Conversion 



Table lookup 
& 

Addition 



Division 



(m,n-\) 



2cos(nm/N) + 
2c os(7r(n- 1)/7V) -4 ! 



ft 



m,n-l 



Address to 

Index 
Conversion 



Table lookup 
& 

Addition 



Division 



(m,n) 



2cos(nm/N)+\ 
2cos(7rn/AQ-4 l 



Address to 

Index 
Conversion 



Table lookup 
& 

Addition 



{m,n+\) 



2cos(7rm/Af) + 
2cos(7r(n + l)/iV)-4 



0 



Division 



Figure 14. The input/output to each stage of the pipeline for the shaded time interval marked 
in the Figure 13. 
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3.5. Analysis of Area Costs and Speed 



Two types of performance are considered in this paper: the latency and area complexities. The latency 
of each unit is defined as the time required for finishing the operations of that unit. Because the arithmetic 
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operators and storage cells are the basic building blocks for the architecture, the area complexities are 
separated into 2 categories: the number of arithmetic operators, and the number of storage cells. The 
arithmetic operators consist of adders, multipliers, and dividers. The storage cells contain registers, 
ROM cells and RAM cells. 

Table 1 summarizes the area complexities and latency of the proposed architecture. Recall that the 
images considered in the proposed architecture are of image resolution (N+l) x (N+l). It can then be 
observed from Table 1 that the number of arithmetic operators in the proposed architecture is independent 
of the image resolution. In addition, the number of storage cells grows linearly with the image resolution. 



Table 1. Area complexities and latency of the proposed architecture with respect to the 
image resolution (N+l) x (N+l), where the function O, termed big O function, is used to 
indicate the asymptotic complexities of the architecture. 





Pre- 


FFT 


Post- 


On-Chip 


Overall 




Transform 




Transform 


Memory 




Arithmetic Operators 


0(1) 


0(1) 


0(1) 


0 


0(1) 


Storage Cells 


0(1) 


0(N) 


0(N) 


0{N 2 ) 


0{N 2 ) 


Latency 


0(N 2 ) 


0(N 2 log TV) 


0(N 2 ) 




0(N 2 log TV) 



The number of arithmetic operators is independent of the image resolution because each unit in the 
proposed architecture only uses a fixed number of adders, multipliers and/or dividers, independent of 
N. For example, the FFT unit only employs one complex multiplier in the ID FFT module. The 
post-transform unit also adopts only one divider. 

The number of storage cells used by the proposed architecture increases with the image resolution. 
For the FFT unit, because the mirror reflection module contains 2N registers, its complexity is 0(N). 
For the post-transform unit, the size of tables for cosine transform is also proportional to N. The number 
of storage cells in the on-chip RAMI and RAM2 is (N+l) x (N+l). Its complexity therefore is 0(N 2 ), 
and grows linearly with the image resolution. 

To evaluate the time complexity, we first note that the pre-transform and post-transform units need to 
perform additions and division to each of the (N + 1) x (N + l) input data points, respectively. Since 
the number of adders and dividers are independent of N, the latency of these units is 0(N 2 ). For the 2D 
FFT and 2D IFFT operations, the latency is given by 0(N 2 logN). 

3.6. Software-Hardware Co-Design 

The proposed architecture is used as a custom user logic in a SOPC system consisting of softcore 
NIOS CPU [28], and the proposed architecture, as depicted in Figure 15. 

The objective of NIOS CPU is to control the data flow of the proposed architecture. Note that the 
on-chip RAM in the proposed architecture provides the source data for pre-transform unit, FFT unit and 
post-transform unit. The on-chip RAM also stores the computation results from these units. To ensure 
that the data in on-chip RAM is delivered to the correct unit, and the computation results of each unit can 
be sent to the on-chip RAM, the CPU is responsible for activating controller at each unit, and specifying 
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the proper value in the status register in the on-chip RAM, which controls the multiplexer in the read and 
write ports of the memory. 

Figure 15. The SOPC system for phase unwrapping. 
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Figure 16. The flowchart of the software executing by the CPU. 
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As shown in Figure 15, the proposed architecture is connected to the Avalon bus for the 
communication with the NIOS CPU. A special interface circuit is designed for the connection. The 
interface circuit contains registers, address decoders and multiplexers so that the on-chip memory and 
the status registers in the controllers of all the units in the proposed architecture can be accessed by the 
NIOS CPU. 

Figure 16 shows the flowchart of the software executed by the CPU. It can be observed from 
Figure 16 that the software only involves the controller activation, as well as specifying the contents 
of status register in the on-chip RAM. Because of the simplicity of the software, the execution of the 
phase-unwrapping algorithm can be significantly enhanced. 

4. Experimental Results 

This section presents some experimental results of the proposed architecture. The design platform is 
Altera Quartus II with SOPC Builder and NIOS II IDE. The target FPGA device for the hardware design 
is Altera Stratix III EP3SL150. 

The hardware resource utilization of each unit in the proposed architecture are revealed in 
Tables 2-4. The images resolutions considered in these tables are 129 x 129 (i.e., N = 128), 
257 x 257 (i.e., N = 256) and 513 x 513 (i.e., N = 512). There are three types of area costs considered 
in the experiment: adaptive logic modules (ALMs), embedded memory bits, digital signal processing 
(DSP) blocks. The ALMs are used for the implementations of arithmetic operators and storage cells. The 
embedded memory bits are used mainly for storage cells. The DSP blocks are used only for arithmetic 
operators such as dividers and multipliers. The target FPGA device Altera Stratix III EP3SL150 contains 
56,800 ALMs, 5,630,976 embedded memory bits, and 384 DSP blocks. 

It can be observed from Tables 2-4 that both the pre-transform and post-transform unit utilize 
only a small percentage of ALMs, embedded memory bits, and DSP blocks. The ALM utilization of 
post-transform unit grows with image resolution because the tables in the cosine computation modules 
are implemented by ALMs. The divider in the post-transform module is implemented by DSP blocks. 
The major hardware resource utilized by on-chip memory is the embedded memory bits, which grows 
linearly with TV 2 . The FFT unit utilizes most of the ALMs, which are used for the design of mirror 
reflection module and ID FFT module. The FFT unit uses DSP block only for the implementation of 
complex multipliers. Since there is only one complex multiplier in the ID FFT module, independent 
of image resolutions. The DSP block utilization of FFT unit is therefore also independent of image 
resolutions. 



Table 2. The ALM utilization of each unit in the proposed architecture for various image 
resolutions. 



Image Resolutions 


Pre-Transform 


FFT 


Post-Transform 


On-Chip Memory 


129 x 129 


197 


8,358 


889 


136 


257 x 257 


201 


10,532 


959 


178 


513 x 513 


221 


19,049 


1,641 


301 
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Table 3. The embedded memory bit utilization of each unit in the proposed architecture for 
various image resolutions. 



Image Resolutions 


Pre-Transform 


FFT 


Post-Transform 


On-Chip Memory 


129 x 129 


0 


61,440 


4,608 


299,538 


257 x 257 


0 


122,880 


4,608 


1,188,882 


513 x 513 


0 


233,472 


4,608 


4,737,042 



Table 4. The DSP block utilization of each unit in the proposed architecture for various 
image resolutions. 



Image Resolutions 


Pre-Transform 


FFT 


Post-Transform 


On-Chip Memory 


129 x 129 


0 


24 


16 


0 


257 x 257 


0 


24 


16 


0 


513 x 513 


0 


24 


16 


0 



Table 5 shows the total hardware resource utilization of the proposed architecture for images with 
different resolutions. It can be observed from the table that the increase in the ALM utilization is small 
when image resolution enlarges. In fact, as the image resolution increases 16 folds (i.e., from 129 x 129 
to 513 x 513), the ALM utilization is only increased approximately 2 folds (i.e., from 9,580 to 2,1212). 
The embedded memory utilization grows linearly with the image resolution. The DSP block utilization 
is independent of the image resolution. In addition to the area complexities of the entire architecture, the 
table consists of the area complexities of the entire SOPC using the proposed architecture as a custom 
user logic. 



Table 5. The total area costs of the proposed architecture for various image resolutions. 







Proposed Arch. 






Entire SOPC 




Image 


ALMs 


Embedded 


DSP 


ALMs 


Embedded 


DSP 


Resolutions 




Memory Bits 


Blocks 




Memory Bits 


Blocks 


129 x 129 


9,580/56,800 


365,586/5,630,976 


40/384 


17,081/56,800 


968,722/5,630,976 


44/384 




(17%) 


(6%) 


(10%) 


(30%) 


(17%) 


(11%) 


257 x 257 


11,870/56,800 


1,316,370/5,630,976 


40/384 


20,905/56,800 


1,916,434/5,630,976 


44/384 




(21%) 


(23%) 


(10%) 


(36%) 


(34%) 


(11%) 


513 x 513 


21,212/56,800 


4,975,122/5,630,976 


40/384 


28,568/56,800 


5,085,778/5,630,976 


44/384 




(37%) 


(88%) 


(10%) 


(50%) 


(90%) 


(11%) 



The execution time of each step of phase unwrapping algorithm implemented by the proposed 
architecture for various image resolutions is shown in Table 6. The execution time is measured by the 
SOPC platform with 500 MHz NIOS softcore CPU. The proposed architecture is used as the accelerator, 
as shown in Figure 17. It can be observed from the table that the FFT and IFFT are the most time 
consuming operations. In particular, when image resolution is 513 x 513, both the FFT and IFFT 
consume 71.9 % of the total computational time. These results are consistent with those shown in 
Table 1, where the FFT unit has the highest time complexity. 
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Table 6. The execution time of the proposed phase unwrapping architecture for various 
image resolutions. 



Image Resolutions 


Pre- 
Transform 


FFT 


Post- 
Transform 


Inverse 
FFT 


Total 


129 x 129 
257 x 257 
513 x 513 


0.1 (ms) 
0.3 (ms) 
1.1 (ms) 


0.3 (ms) 
0.9 (ms) 
3.2 (ms) 


0.2 (ms) 
0.4 (ms) 
1.4 (ms) 


0.3 (ms) 
0.9 (ms) 
3.2 (ms) 


0.9 (ms) 
2.5 (ms) 
8.9 (ms) 



Figure 17. The phase unwrapping results for 257 x 257 images: (a) The phase wrapped 
image, (b) The phase unwrapped image produced by the proposed architecture. 




(b) 

Although the division and cosine operations are required in the post-transform unit, the computation 
time is low and comparable to that of the pre-transform unit. For a513 x 513 image, the post-transform 
unit consumes only 15.7% of the total computation time. The fast computation is due to the efficient 
single-address-multiple-data operations as revealed in Figures 14 and 15. The single address for 
retrieving r m n actually produces r m n , cos(yTm//V) and cos (7m/ AT) for computing $ m n in a pipeline 
fashion. As a result, the architecture is able to minimize number of memory accesses while maintaining 
high throughput. 

Table 7 shows the speed of the proposed phase unwrapping architecture for images with various 
resolutions. The speed of its Matlab software counterpart running on the 2.8 GHz Intel 17 quad-core 
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processor with 4 GB DDR III is also included in the table for comparison purpose. We can see from 
Table 7 that the proposed hardware circuit operates at significantly faster speed. The speedup over its 
software counterpart grows with the image resolutions. As the image resolution reaches 513 x 513, 
the speedup becomes 605. Note that the total computation time of phase unwrapping for 513 x 513 
images is only 8.9 ms. This implies that the proposed architecture can support rendering with frame rate 
100 fps. The proposed architecture therefore can be effectively employed for embedded DHM requiring 
rendering with high frame rate. 

Table 7. The execution time of the proposed phase unwrapping architecture for various 
image resolutions. 



Image 


Proposed 


Software 


Speedup 


Resolutions 


Architecture 


Counterpart 




129 x 129 


0.9 (ms) 


468 (ms) 


585 


257 x 257 


2.5 (ms) 


1504 (ms) 


601 


513 x 513 


8.9 (ms) 


5389 (ms) 


605 



Table 8 lists the computation time of various existing phase unwrapping implementations. The 
direction comparisons of these implementations may be difficult because these implementations are 
based on different phase unwrapping algorithms with different image resolutions. In addition, these 
implementations are realized by different platforms. Nevertheless, we can still see from Table 8 that the 
proposed is an effective alternative for phase unwrapping when the computation time is an important 
concern. 

Table 8. The execution time of different phase unwrapping implementations. 



Implementations Computation 

Time 



Image Resolutions 



Platforms 



Proposed 
Architecture 
[19] 

[21] 

[20] 



8.9 (ms) 
672 (ms) 
2.8 (s) 
24.7 (s) 



513x 513 
512x 512 
640 x 480 
l,024x 512 



FPGA (Altera Stratix III 

EP3SL150) 
GPU (NVIDIA Geforce 

8800GTX) 
GPU (NVIDIA Geforce 

8800GTX) 
FPGA (Xilinx Vertex II 
Pro) 



Figures 17 and 18 show the phase unwrapping results of the proposed architecture. The images 
considered in the experiments are produced by the DHMs in the IOP lab at the Institute of Electro-Optical 
Science and Technology, National Taiwan Normal University. In the experiments, microlens made by 
SUSSA (with radius of curvature 120 microns) are tested. The image resolution is 257 x 257 and 
513 x 513 for the images shown in Figures 17 and 18, respectively. To evaluate the accuracy of 
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the reconstruction, the radius of the curvature in the phase unwrapped results are measured, and are 
compared with the actual radius of the curvature of the microlens. The measured radius of curvature 
of the two microlens in Figure 17 are 121.0 and 121.1 microns, respectively. The measured radius of 
curvature of the microlen in Figure 18 is 119.7 microns. The maximum error of the unwrapped results 
therefore is only 1.1 microns for the measurement of radius of curvature. 

Figure 18. The phase unwrapping results for 513 x 513 image: (a) The phase wrapped 
image, (b) The phase unwrapped image produced by the proposed architecture. 




y-axis [^im] 0 0 x-axis [jum] 



(b) 

5. Concluding Remarks 

The proposed architecture has been found to be effective for phase unwrapping. It utilizes low 
hardware resources. Only a single divider and complex multiplier is used in the architecture. The 
utilization of DSP blocks therefore is minimized. The ALM and memory bits utilization also only grow 
linearly with the image resolutions. Each unit in the architecture is implemented in a pipeline fashion 
for enhancing the throughput. The architecture therefore has fast computation speed. In particular, when 
the image resolution is 513 x 513, the computation time is only 8.1 ms. The speedup attains 605 over 
its software counterpart. The architecture is able to support frame rate above 100 fps for embedded 
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DHM rendering. The architecture is an effective alternative for the implementation of embedded DHM 
systems where low hardware resource utilization, high image resolution and high image rendering rate 
are desired. 
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